I am a PhD student in linguistics with some previous experience in R and RStudio as well as basic statistics, but with this course I hope to gain more understanding in statistical methods and open data that I could use in my research. My GitHub repository.
For the tasks this week I wrangle data into data frames in the form required for analysis. I have used the data to analyse the relationship between a dependent and an independent variable by fitting a linear regression model. I can also make plots from the residuals to diagnose the appropriateness of the model.
## Warning: package 'ggplot2' was built under R version 3.3.3
In this analysis we investigate the how the points scored in a statistics test by a student are explained by different learning approaches and the student’s attitude, age and gender. To do this, we shall analyse the explanatory powers of the different potential explanatory variables by building a regression model where exam points is the dependent variable.
The data for this analysis is a survey dataset listing the learning approaches in a statistics class. More information on its collection can be found here. The data has 166 observations and 7 variables. Below follows a sample of what the dataset used the purposes of this study look like.
## gender Age Attitude deep stra surf Points
## 1 F 53 37 3.583333 3.375 2.583333 25
## 2 M 55 31 2.916667 2.750 3.166667 12
## 3 F 49 25 3.500000 3.625 2.250000 24
## 4 M 53 35 3.500000 3.125 2.250000 10
## 5 M 49 37 3.666667 3.625 2.833333 22
As can be seen, observations include the gender and age of the respondents as well as the attitude towards statistics, and max points. It should be noted that respondents whose score in Points equals 0 have been removed from the current 166-observation dataset. The scores for variables deep, strategic and surface approaches to learning are the means of the combined sums from individual variables: for example, deep represents the mean of the scores for the following variables.
deep <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06", "D15", "D23", "D31")
The following offers a overview visualisation of correlations within the data. It is for example noteworthy that there are twice as many women as men. In terms of correlations, at 0.437 the strongest correlation is that between Points and Attitude, noticeable among both genders. The lowest (absolute) correlation is between Attitude and Age. This suggests that attitude towards statistics learning does not depend on the age of the respondent, but attitude does play an important role in the points scored by the respondent. Finally, although low, two more correlations worth noting are the 0.146 correlation between Points and strategic approach, and the -0.144 correlation between Points and surface approach.
In addition, the correlation between surface and deep approach, although minimal among women, is -0.622 among men, suggesting that men who use surface approach are less likely to use deep approach and vice versa. As seen in the figure, there is also some negative correlation between surface approach and Attitude, again more among men, as well as surface and strategic approach.
Based on the overview above, we shall first investigate the potential of a lineral model that assumes Points are explained by the variables Attitude, surface and strategy. The results of this model are given below:
##
## Call:
## lm(formula = Points ~ Attitude + surf + stra, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## surf -0.58607 0.80138 -0.731 0.46563
## stra 0.85313 0.54159 1.575 0.11716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
According to the above results, the only variable for which the p-value is <0.001 is Attitude, the variables relating to approach being > 0.05. This means that for these variables we fail to reject the null hypothesis. Thus, below we shall fit a new model with only the statistically significant variable Attitude.
The results above suggested that only Attitude can be used to explain the dependent variable Points. In other words, our model is:
Points = α + βAttitude + ε
where
α is Points when Attitude = 0
β is our coefficient
ε is an unobservable random, normally distributed variable
The results and a plot illustrating this model are given below:
##
## Call:
## lm(formula = Points ~ Attitude, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
The line is positioned so as to have the lowest residuals (differences between observed value and predicted value), but quite a few observations fall further outside of the line. Indeed, the multiple R-squared is 0.1906, indicating that only 19 % of the Points-observations are explained by the variable Attitude. However, the p-value <0.001 gives reason to believe the variable is nonetheless statistically significant.
The current model assumes that Attitude can be explained by a single explanatory variable, Attitude, and that linear regression model (as opposed to e.g. a polynomial regression model) is sufficient for explaining and predicting the observations. To diagnose the model, we shall plot the residuals. Plotting the residuals and fitted, we can see that the model is reasonable.
In the Residuals vs Fitted and Residuals vs Leverage plots, the observations are seemingly evenly distributed and there is no visible pattern that would suggest the need for a polynomial regression model. In the Normal Q-Q plot, most the observations line up on the same line, with some deviation only among the lowest and the highest values. This indicates the model is adequate for explaining our dependent variable and that we do not need to resort to a more complex model.
In this section I use a logistic regression model to study what variables impact alcohol consumption.
The data for the present study is a dataset on alcohol consumption in secondary school students in two schools in Portugal. More information (and download) is available here. The variables include varibales on e.g. the student’s background (family size, education and employment of parents), behaviour (e.g. amount of freetime, health status) as wekk as performance in the subjects of Maths and Portuguese. In the dataset of this study, the original data has already been modified so that variables for alcohol consumption on weekdays and alcohol consumption on weekends has been interpreted into a single variable alc_use, which is the mean of the two. Further, a binary column high_usehas been created based on alc_use, where “high_use = TRUE” signifies a mean alcohol consumption greater than 2.
A list of all variables is shown below:
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Several of the variables are numeric, using a Likert-scale type 1-5 range (very low - very high), including the values for alcohol consumption. Other variables are binary yes/no data, such as whether the student is in a romantic relationship.
The purpose of this study is to explore how variables predict high/low alcohol consumption among the observations. I hypothesise that alcohol consumption is related to the following four variables:
The hypotheses above reflect only my current intuitions and are not based on research literature on alcohol consumption oron previous familiarisations with the data. It should also be mentioned that the purpose of this analysis is to study the relationship, but not even a strong relationship is necessarily a sign of causality, let alone a sign of which variable is the cause and which the effect.
The first hypothesis is that male students have higher alcohol consumption than female students. It should be noted that the number of females (198) is around the same as males (184). Below the numbers of high consumption have been given in proportions.
The figure shows that our hypothesis is supported. As can be seen, male students have a higher percentage off alcohol consumption, supporting the hypothesis that there are more men with high consumption rates than women.
Our second hypothesis is that students who study more are less likely to have a high alcohol consumption rate (negative correlation). Study time is measured with the following 4-point scale, representing study hours per week: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours. In the figure below, the findings are again split across sexes. A boxplot was chosen as in spite of the values being numbers, Likert-type data is better thought of as ordinal rather than continuous data, and any averages or quartiles produced from it are misguiding at best.
The graph indicates that high consumption is indeed more common among students with lower weekly totals of study time. Interestingly, male students have a break in the pattern, with students who study 5-10 hours per week having a noticeably lower number of high consumers than all other groups apart from females studying more than 10 hours. However, our overall hypothesis is still supported.
Here it is hypothesised that students who go out are more likely to have high consumption rates. The frequency of going out with friends is here measured using a numeric 5-step Likert-scale, where 1 = very low, and 5 = very high. Again, we examine this first in boxplots.
It appears that high consumption is more common among students who go out frequently, but this is especially pronounced among male students, whereas the female students who go out the most have a lower number of high consumers than the second-highest female group. Thus, our hypothesis is applicable mainly for male students.
Another assumption that we have is that students with high alcohol consumption are also more likely to have more absences from school. Absences are here a numeric number representing the number of absences. We shall explore the numerical value in a boxplot, where the dotted line indicates the mean.
The means indicate a difference, but this may be because of outliers. Not counting the outliers marked as dots beyond the whiskers, the boxplots above suggest that there may be a relationship between absences and alcohol among male students, but among female students the differences are quite small, the medians being about even and the main differences being visible in the top 50%. Based only on this, we cannot outright claim that this supports our hypothesis.
In this section we shall fit a logistic regression model that could explain and predict high alcohol use among the population. We shall at least not yet remove any of the variables, and our model and its statistical summary is thus as follows.
high_use ~ sex + studytime + goout + absences
##
## Call:
## glm(formula = high_use ~ sex + studytime + goout + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9103 -0.7892 -0.5021 0.7574 2.6021
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.20483 0.59429 -5.393 6.94e-08 ***
## sexM 0.78173 0.26555 2.944 0.003242 **
## studytime -0.42116 0.17058 -2.469 0.013551 *
## goout 0.72677 0.11994 6.059 1.37e-09 ***
## absences 0.07811 0.02244 3.481 0.000499 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 381.31 on 377 degrees of freedom
## AIC: 391.31
##
## Number of Fisher Scoring iterations: 4
Based on the above, our highly statistically significant p<0.001 variables are goout and absences, with the variable sex having a statistically significant p-value of <0.01, and studytime having a statistically significant p-value of <0.05. In other words, we fail to reject the null hypotheses for all our variables.
The coefficients of the variables are as follows:
## (Intercept) sexM studytime goout absences
## -3.20483157 0.78173290 -0.42116184 0.72676824 0.07810746
As hypothesised, study time is the only explanatory variable with negative correlation. Absences has the lowest absolute coefficient. We shall take the exponents of the coefficients to analyse the odds ratios of the variables as well as their confidence intervals.¨
Odds are calculated as p/(1-p), where P is probability. The odds ratio is the ratio of the odds of two values, thus quantifying their relationship. The odds ratio also represents the increase in odds for an increase in the variable by one unit. The confidence interval then explains the precision of the odds ratio. A 97.5 % confidence interval indicates a range for how precise the odds ratio is: the interval indicates a range with 97.5 % probability of this calculation being correct for this population.
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Waiting for profiling to be done...
## OddsRatio 2.5 % 97.5 %
## (Intercept) 0.04056573 0.01222314 0.1263579
## sexM 2.18525582 1.30370562 3.7010763
## studytime 0.65628388 0.46510383 0.9099505
## goout 2.06838526 1.64470314 2.6349716
## absences 1.08123884 1.03577383 1.1324143
The odds ratios show the increase in the odds for increased high alcohol consumption for every increase in category of study time, going out or absences. For sex, it indicates that the odds of a man having high alcohol consumption is 2.18 higher than women. Ranked from highest to lowest odds ratio, i.e. the most significant change in odds per unit, sex is at the top and study time the lowest. However, the confidence intervals are fairly wide, meaning that a large range is needed to be 97.5 % certain of the matter when it comes to the total population.
Finally, we shall evaluate the prediction accuracy of our logistic regression model. That is, if the model was used to predict each observation, with what frequency would it be able to correctly categorize an observation as having either high or low alcohol consumption rate. In this case the accuracy of our model is as follows (in n of observations and proportions):
## prediction
## high_use FALSE TRUE
## FALSE 252 16
## TRUE 65 49
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65968586 0.04188482 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.82984293 0.17015707 1.00000000
The so-called confusion matrix above (with rounded proportions) indicates that our model categorises 66.0 % (N = 252) of the observations have a low consumption rate and are by the model predicted as such. Correspondingly, 12.8 % (N = 29) of observations have both high consumption and are accurately predicted as such. The remaining were predicted incorrectly (have high use but were predicted as low, or vice versa).
In short, approx. 78.9 % of the observations were predicted correctly and 21.2 % (0.2120419) were predicted incorrectly. Visualised, the data appears as follows:
Supposing that we had not fitted a logistic regression model but simply guessed our way, the number of correct guesses can be different (ideally lower) than the number of corrects predicted by the model. If we assumed none of the students to have a high alcohol consumption rate (probability = 0): our error rate would be as the following:
## [1] 0.2984293
We shall also test the predictive power of a 50/50 guessing method where the probability for a student to have a high alcohol consumption rate is assumed to be 0.5 (i.e. the rate is as likely to be high as to be low). We calculate the mean of incorrect guesses as:
## [1] 0
Interestingly, our calculation would indicate a mean of incorrect guesses is 0, i.e. that all guesses are correct. I find this somewhat suspicious, but unfortunately the investigation of the accuracy of this falls outside of the scope of this analysis.
Lastly, we shall use the so-called 10-fold validation to assess the accuracy of the model fitted above. This method is a form of cross-validation and the main goal is therefore to validate the model on an independent data set that has not been used for training the model. In 10-fold validation, the data is split into 10 parts, from which 9 parts are used for training, and the remaining 1 for testing. The process is then repeated with a different part functioning as testing set. The mean prediction error is as follows:
## [1] 0.2172775
In summary, the predictive of the model on test data is very close to the predictive power of our model (0.2120419). The predictive error is also lower than that of a similar model tested in the DataCamp exercises leading up to this analysis, where the explanatory variables were assumed to be sex, absences and failures in classes.
In this analysis, we apply clustering and classification methods to study a dataset. Based on classification from training data we are able to classify also training data.
## Warning: package 'MASS' was built under R version 3.3.3
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Warning: package 'tibble' was built under R version 3.3.3
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
## nasa(): dplyr, GGally
## select(): dplyr, MASS
## Warning: package 'corrplot' was built under R version 3.3.3
## corrplot 0.84 loaded
From the R package “MASS”, the data we shall use for this analysis represents housing values in Boston from the 1970s. A quick overview of the variables is provided below, where the variables stand for the following (Source):
*zn: proportion of residential land zoned for lots over 25,000 sq.ft.
*indus: proportion of non-retail business acres per town.
chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox: nitrogen oxides concentration (parts per 10 million).
rm: average number of rooms per dwelling.
age: proportion of owner-occupied units built prior to 1940.
dis: weighted mean of distances to five Boston employment centres.
rad: index of accessibility to radial highways.
tax: full-value property-tax rate per $10,000.
ptratio: pupil-teacher ratio by town.
black: 1000(Bkâ0.63)^2 where Bk is the proportion of blacks by town.
lstat: lower status of the population (percent).
medv: median value of owner-occupied homes in $1000s.
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The variable rad and the dummy variable chas are the only integer variables, the remaining being numerical.
Before we begin our analysis, we shall standardise the dataset. This means that we assume normal distribution and adjust the numeric values of variables so that mean = 0, all values indicating a distance from the mean in units of standard deviation. The data now looks as follows (cf. with table above):
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Before continuing, we shall modify the variable crim into a categorical factor variable based on its quantiles, seen above as its values for Min, 1st Qu., Median, 3rd Qu., and Max. The variable (renamed crime) now has the following factors and observations per factor:
##
## low med_low med_high high
## 127 126 126 127
We shall also divide the dataset into a testing and training set. The training set will consist of a sample of 80 % of the whole dataset, the testing set the remaining 20 %.
## zn indus chas
## Min. :-0.48724 Min. :-1.51549 Min. :-0.27233
## 1st Qu.:-0.48724 1st Qu.:-0.91166 1st Qu.:-0.27233
## Median :-0.48724 Median :-0.21089 Median :-0.27233
## Mean : 0.06743 Mean :-0.01142 Mean : 0.03646
## 3rd Qu.:-0.08527 3rd Qu.: 1.01499 3rd Qu.:-0.27233
## Max. : 3.58609 Max. : 2.42017 Max. : 3.66477
## nox rm age
## Min. :-1.42991 Min. :-1.96641 Min. :-2.08800
## 1st Qu.:-0.86682 1st Qu.:-0.57839 1st Qu.:-0.85705
## Median :-0.23037 Median :-0.22578 Median : 0.42364
## Mean : 0.06059 Mean : 0.03899 Mean : 0.02583
## 3rd Qu.: 0.93249 3rd Qu.: 0.49723 3rd Qu.: 0.85972
## Max. : 2.72965 Max. : 3.47325 Max. : 1.11639
## dis rad tax
## Min. :-1.2658165 Min. :-0.98187 Min. :-1.30676
## 1st Qu.:-0.8095571 1st Qu.:-0.63733 1st Qu.:-0.75940
## Median :-0.4617172 Median :-0.52248 Median :-0.37521
## Mean : 0.0004225 Mean : 0.05625 Mean : 0.02809
## 3rd Qu.: 0.7440043 3rd Qu.: 1.65960 3rd Qu.: 1.52941
## Max. : 2.5764502 Max. : 1.65960 Max. : 1.79642
## ptratio black lstat medv
## Min. :-2.70470 Min. :-3.8792 Min. :-1.4260 Min. :-1.90634
## 1st Qu.:-0.46446 1st Qu.: 0.2353 1st Qu.:-0.7636 1st Qu.:-0.54178
## Median : 0.25149 Median : 0.3883 Median :-0.1111 Median :-0.11773
## Mean : 0.01058 Mean : 0.1411 Mean : 0.0495 Mean : 0.09205
## 3rd Qu.: 0.80578 3rd Qu.: 0.4406 3rd Qu.: 0.5986 3rd Qu.: 0.27641
## Max. : 1.26768 Max. : 0.4406 Max. : 2.5426 Max. : 2.98650
## crime
## low :28
## med_low :24
## med_high:21
## high :29
##
##
Linear Discriminant Analysis (LDA) is a classification method for continuous normally distributed classes such as in our scaled dataset in this analysis. Using crime as our target variable, we shall use LDA on our training set so as to later test it on the testing set
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2450495 0.2524752 0.2599010 0.2425743
##
## Group means:
## zn indus chas nox rm
## low 0.86772064 -0.9117255 -0.113254311 -0.8559192 0.43722815
## med_low -0.05090227 -0.3336183 -0.002135914 -0.6209215 -0.09659554
## med_high -0.37943482 0.2454243 0.177625245 0.3913831 0.05252999
## high -0.48724019 1.0171960 -0.111631099 1.0285199 -0.43801498
## age dis rad tax ptratio
## low -0.8761555 0.8224485 -0.6825737 -0.7404465 -0.4516307
## med_low -0.4178903 0.4167778 -0.5495071 -0.4866089 -0.1076168
## med_high 0.4423203 -0.3907026 -0.4054501 -0.2690322 -0.2077740
## high 0.8192450 -0.8464597 1.6373367 1.5134896 0.7798552
## black lstat medv
## low 0.38004847 -0.769773881 0.51833638
## med_low 0.30284640 -0.166041183 0.01295565
## med_high 0.08102127 -0.005963686 0.10733745
## high -0.93285252 0.905313017 -0.74792289
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11204850 0.486900816 -1.11026555
## indus 0.02430163 -0.360159169 0.29034052
## chas -0.11344102 -0.006334850 0.20103049
## nox 0.44918572 -0.898793742 -1.24339451
## rm -0.06031348 -0.056660379 -0.10599577
## age 0.25577131 -0.387833245 -0.16624840
## dis -0.10488040 -0.225491698 0.31323980
## rad 2.95398512 0.932620317 -0.20078859
## tax -0.02847357 0.127736250 0.61568770
## ptratio 0.12858149 -0.091210558 -0.34358310
## black -0.13269809 -0.004660517 0.07358266
## lstat 0.17834616 -0.057512725 0.51580490
## medv 0.16039606 -0.377912300 -0.15299087
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9418 0.0428 0.0154
Based on the above, the 1st linear discriminant (LD1) explains as much as 95 % of the variance, LD2 explaining 3.6 % and LD3 1.2 %. Visualising the observations across LD1 and LD2, the data looks as follows:
Based on the classification data of the LDA above, we shall classify the observations of the test data, i.e. we are testing the predictive accuracy of the LDA on data not part of the model’s training data. The results of the prediction are as follows:
## predicted
## correct low med_low med_high high
## low 18 7 3 0
## med_low 0 14 10 0
## med_high 0 7 14 0
## high 0 0 0 29
Out of a total of 102 observations, the number of incorrectly classified observations is 30 = 4 + 9 + 9 + 5 + 2 + 1, i.e. 29.4 %.
Here we shall investigate the distances between observations. To do this, we begin with a fresh dataset, i.e. the original Bostond dataset prior to the modifications described above. We standardise the dataset and
data('Boston')
boston_scaled <- scale(Boston)
## [1] "Euclidean distance"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.620 170.500 226.300 371.900 626.000
## Warning in dist(boston_scaled, method = "manhattan"): NAs introduced by
## coercion
## [1] "Manhattan distance"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2849 8.8390 13.0300 13.8700 18.1000 46.8900
Let us visualise the k-means of the dataset. Here the optimal number of clusters is determined by looking at the total of within cluster sum of squares (WCSS), namely, where the line in a line plot drops rapidly.
Based on the above, we select 2 as the optimal number of clusters. Visualised, the clusters appear as follows
We shall create a 3D-plot of the data across the three linear dimensions, setting the colour to represent crime quantiles.
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## [1] 404 13
## [1] 13 3
Looking at the same model with color signifying the clusters:
## [1] 404 13
## [1] 13 3
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Here the low and median low are merged into cluster 1, whereas high and median high make up the second cluster. A 3D view illustrates the heterogeneity of the clusters: cluster 1 has limited y-values (LD2), but a far range of x-values (LD1), whereas for cluster 2 the opposite is true. Both clusters have values on the z-value. However, it should be noted that LD3 had the smallest proportion of trace and also that of LD2 was quite low. In light of this, the plotting seems appropriate enough.
Because analysing and visualising tens of dimensions is much more difficult, it is often useful to reduce the number of dimensions by merging correlating variables together. In this analysis we conduct a Principle Component Analysis _ and a_ Multiple Correspondence Analysis_._
The data we use here was created as part of the UN’s Human Development Programme, reflecting the Human Development Index (HDI), Gross National Income, and variables such as the expected education level or percentage of women in labor in different countries. For an overview, the variables are:
## 'data.frame': 155 obs. of 8 variables:
## $ eduratio: num 1.007 0.997 0.983 0.989 0.969 ...
## $ labratio: num 0.891 0.819 0.825 0.884 0.829 ...
## $ exped : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ lifexp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ gni : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ matmor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ adbirth : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ parlperc: num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
i.e.
Note that any countries with missing values for any variables have been deleted from the present data.
First we shall conduct a Principal Component Analysis (PCA) using the same data as above with non-standardized variables.
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
As seen in the biplot visualisation above, GNI clearly stands out from the other variables. However, this is due to this variable being numerically on a completely different scale than the other variable values, tens of thousands vs. <1 or <100% ratios. This causes the correlation to be unreliable (with only outliers showing) not to mention difficult to analyse. The data must thus be scaled.
To amend for the abovementioned problem, we must conduct a PCA using a standardized set of the data, where mean = 0 and standard deviation 1. The results are visualised using PC1 and PC2, along with the percentages of explained variance for both principal components. Together the components explain 69.8 of the variance.
## eduratio labratio exped lifexp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## gni matmor adbirth parlperc
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
## [1] "PC1 (53.6%)" "PC2 (16.2%)" "PC3 (9.6%)" "PC4 (7.6%)" "PC5 (5.5%)"
## [6] "PC6 (3.6%)" "PC7 (2.6%)" "PC8 (1.3%)"
From this biplot we can see that expected level of education correlates - expectedly - with the ratio of percentages of females vs. males with secondary education as well as gni and life expectancy at birth. Further, these four variables correlate negatively with maternal mortality and adolescent birth rate. We also see that the ratio of percentages of females vs males in labour correlates with the percentage of female representatives in the parliament.
I interpret the two principal components identified above as follows:
Along PC1 (53.6% of variation) we have variables that represent life and mortality, but also education and gni, Meanwhile, PC2 (16.2% of variation) represents two variables tied to the representation of females in worklife and politics. Female representation alone does not guarantee low maternal mortality or adolescent birth rate (cf. Mozambique in biplot above), but high education levels or gni also do not guarantee equal representation (cf. Iran or Qatar)
Here we shall switch datasets and use data from the FactoMineR-package describing tea consumption and how tea is usually enjoyed. We shall analyse the data through Multiple Correspondence Analysis (MCA). The dataset has 36 variables in total, but here we are interested in the place where and time when the respondents of the survey consume tea. Our variables are: “SPC”, “price”, “frequency”, “how”, “How”. The results appear as follows:
## Warning: attributes are not identical across measure variables; they will
## be dropped
##
## Call:
## MCA(X = tea2)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.293 0.263 0.248 0.243 0.233 0.221
## % of var. 9.772 8.766 8.268 8.085 7.774 7.357
## Cumulative % of var. 9.772 18.537 26.806 34.891 42.665 50.021
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.208 0.197 0.191 0.183 0.166 0.159
## % of var. 6.936 6.553 6.382 6.103 5.517 5.306
## Cumulative % of var. 56.957 63.509 69.891 75.994 81.511 86.817
## Dim.13 Dim.14 Dim.15
## Variance 0.139 0.129 0.128
## % of var. 4.618 4.312 4.253
## Cumulative % of var. 91.435 95.747 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | 0.683 0.530 0.204 | 0.325 0.133 0.046 | -0.369 0.183
## 2 | 0.480 0.262 0.083 | -0.147 0.027 0.008 | -0.360 0.174
## 3 | -0.240 0.065 0.017 | -0.551 0.385 0.087 | -0.611 0.501
## 4 | 0.231 0.061 0.033 | -0.259 0.085 0.041 | -0.370 0.184
## 5 | -0.033 0.001 0.001 | -0.047 0.003 0.001 | 0.060 0.005
## 6 | 0.231 0.061 0.033 | -0.259 0.085 0.041 | -0.370 0.184
## 7 | 0.665 0.502 0.122 | 0.526 0.351 0.076 | 0.502 0.339
## 8 | 0.381 0.165 0.041 | 0.080 0.008 0.002 | -0.739 0.733
## 9 | 0.712 0.576 0.155 | -0.159 0.032 0.008 | 1.220 2.002
## 10 | 0.359 0.147 0.049 | 0.183 0.042 0.013 | 1.131 1.720
## cos2
## 1 0.060 |
## 2 0.047 |
## 3 0.107 |
## 4 0.084 |
## 5 0.002 |
## 6 0.084 |
## 7 0.069 |
## 8 0.156 |
## 9 0.456 |
## 10 0.488 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## employee | -0.325 1.421 0.026 -2.785 | -0.196 0.572 0.009 -1.673 |
## middle | 0.472 2.029 0.034 3.203 | 1.314 17.501 0.266 8.910 |
## non-worker | -0.254 0.942 0.018 -2.291 | -0.038 0.023 0.000 -0.341 |
## other worker | 0.618 1.740 0.027 2.858 | -1.156 6.772 0.095 -5.341 |
## senior | 1.211 11.680 0.194 7.612 | 0.080 0.057 0.001 0.506 |
## student | -0.751 8.985 0.172 -7.167 | -0.182 0.585 0.010 -1.731 |
## workman | 1.201 3.939 0.060 4.240 | -0.465 0.659 0.009 -1.643 |
## 1/day | 0.611 8.075 0.173 7.196 | -0.520 6.515 0.125 -6.122 |
## 1 to 2/week | 0.344 1.184 0.020 2.466 | 0.064 0.045 0.001 0.457 |
## +2/day | -0.529 8.080 0.205 -7.837 | 0.038 0.046 0.001 0.558 |
## Dim.3 ctr cos2 v.test
## employee 0.375 2.228 0.034 3.207 |
## middle -0.369 1.466 0.021 -2.504 |
## non-worker -0.188 0.609 0.010 -1.694 |
## other worker -1.094 6.428 0.085 -5.054 |
## senior 1.486 20.786 0.292 9.341 |
## student -0.374 2.626 0.042 -3.564 |
## workman 0.058 0.011 0.000 0.204 |
## 1/day -0.034 0.029 0.001 -0.395 |
## 1 to 2/week -0.976 11.275 0.164 -7.000 |
## +2/day 0.289 2.845 0.061 4.278 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## SPC | 0.451 0.344 0.424 |
## frequency | 0.258 0.258 0.184 |
## how | 0.063 0.547 0.521 |
## How | 0.149 0.139 0.102 |
## sex | 0.546 0.027 0.010 |
We need up to 6 dimensions to be able to explain even 50 % of the variance. However, it appears for example that the middle class favours unpackaged (loose leaf) tea above other society groups. Further, they do not drink tea with milk as much as the working class. (It is known in sociology that the middle class tends to distinguish itself from lower classes but that their practices differ from those of upper classes.) Raising concerns about the data is that students and females seem to have strong correlation, whereas senior correlates with male. This could indicate a data sampling where a sex is overrepresented in a group. However, investigating the data further falls outside the scope of the current analysis.
This week we apply the methods learned so far to longitudinal data, i.e. data collected over a longer timeperiod, with several observations per subject.
The data used in this exercise, based on data by Kimmo Vehkalahti found here, is from a nutrition study on three groups of rats, each rat’s weight being recorded approximately once a week for 9 weeks. That is, the data is longitudinal data tracking the individuals over a time period. The goal is to compare the trends of each group of rats. The data has been wrangled to a long (as opposed to wide) format.
Simply put, the variables are as follows:
## ID Group WD Weight Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
For a graphical overview, these are the trends over the 9-week time period across the three groups:
The rats in the first group weigh noticeably less than the other groups, whereas the rats in group 2 seem to at the end of the experiment weigh closer to what the rats in group 3 did at the beginning. Group 2 also includes what seems to be an outlier. With standardised weights, the data appears as follows, allowing for a closer tracking of individuals:
To see the forest for the trees - or the groups for the individuals - it is helpful to plot the data in a way that shows the groups, not individuals.
We shall now plot the summary measures, i.e. single values that represent all the observations of an individual. For this we use only the observations after the first measurement (Day 1).
There are outliers in all Groups, so to remove bias we remove the outliers, i.e. the highest value from group 2 and the lowest values from Group 1 and Group 3. We repeat the plot without these outliers:
Now the groups appear more obviously different. Group 1 is noticeably lighter in weight than the two other groups, and the weights of the rats in Group 3 are very close to each other. We shall verify the difference of Group 2 and 3 using Student’s t-test.
##
## Two Sample t-test
##
## data: mean by Group
## t = -18.235, df = 4, p-value = 5.32e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -98.94088 -72.79246
## sample estimates:
## mean in group 2 mean in group 3
## 452.4000 538.2667
The p-value being 5.32e-05 confirms that these Groups indeed differ from each other in a statistically significant way.
BPRS is a measure on behaviour and psychiatric symptoms (Wikipedia). In this dataset from here male subjects from two treatment groups have been tested and rated on the BPRS in one initial test (no. 0) plus eight weeks. This too, is a longitudinal dataset that has been wrangled into long form. The data looks something like this:
## treatment subject weeks bprs week
## 1 1 1 week0 42 0
## 2 1 2 week0 58 0
## 3 1 3 week0 54 0
## 4 1 4 week0 55 0
## 5 1 5 week0 72 0
## 6 1 6 week0 48 0
ggplot(BPRS, aes(x=week, y=bprs, linetype=subject, fill = treatment)) +
geom_line() +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Week no.") +
scale_y_continuous(name = "BPRS")+
ggtitle("BPRS over time across 2 treatment groups") +
theme_bw() + theme(legend.position = "none")
ggplot(BPRS, aes(x = week, y = bprs, group = subject, color = treatment)) +
geom_text(aes(label = treatment), position = position_dodge(width = 0.3)) +
theme_bw() + ggtitle("BPRS of individuals in treatment groups")
In the first treatment group, the BPRS decreases over time for most individuals whereas in the second group the variation continues over the course of the testing.
We shall fit a linear regression model for the data using the variables treatment and week as explanatory variables, and bprs as the target variable.
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Based on this model, the treatment is not a statistically significant variable. However, it should be noted that for longitudinal data, a linear model is quite naive - even false - because it does not track individuals over time, i.e. it treats observations over time as independent of each other - which does not correspond to the reality of the situation. Thus, using the same variables as above, we must also fit a random intercepts model, which does take in account the longitudinal aspect of the data and assumes heterogeneity in intercepts.
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Below is a random slope model for the data, which assumes heterogeneity in slopes. As well as an analysis of variance (ANOVA) test on the above random intercept and the below random slope model.
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8202 8.0511
## week 0.9608 0.9802 -0.51
## Residual 97.4307 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
##
##
## ANOVA test
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA results indicate a fairly good fit of the models (p<0.05), i.e. the models are statistically significantly different from each other. We shall thus fit a random intercept and slope model. Instead of being distinguised, we allow for week*treatment interaction. We conduct an ANOVA to compare this with the above models.
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRS
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0767 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 65.0016 8.0624
## week 0.9688 0.9843 -0.51
## Residual 96.4699 9.8219
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2522 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
##
##
## ANOVA test with random intercept model
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 10.443 3 0.01515 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## ANOVA test with random slope model
## Data: BPRS
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Evidently the difference to the random intercept model is statistically significant, but the difference to the random slope model less so (yet still p<0.1).
Finally, we shall draw a plot that shows the fitted values of the random intercept and slope model.
The difference is minimal even looking at the visualisation, and indeed Vehkalahti & Everitt (2019: Chapter 8) confirm that the two treatment groups are not statistically significant.